36 research outputs found

    Topic Identification for Speech without ASR

    Full text link
    Modern topic identification (topic ID) systems for speech use automatic speech recognition (ASR) to produce speech transcripts, and perform supervised classification on such ASR outputs. However, under resource-limited conditions, the manually transcribed speech required to develop standard ASR systems can be severely limited or unavailable. In this paper, we investigate alternative unsupervised solutions to obtaining tokenizations of speech in terms of a vocabulary of automatically discovered word-like or phoneme-like units, without depending on the supervised training of ASR systems. Moreover, using automatic phoneme-like tokenizations, we demonstrate that a convolutional neural network based framework for learning spoken document representations provides competitive performance compared to a standard bag-of-words representation, as evidenced by comprehensive topic ID evaluations on both single-label and multi-label classification tasks.Comment: 5 pages, 2 figures; accepted for publication at Interspeech 201

    The fifth 'CHiME' Speech Separation and Recognition Challenge: Dataset, task and baselines

    Get PDF
    International audienceThe CHiME challenge series aims to advance robust automatic speech recognition (ASR) technology by promoting research at the interface of speech and language processing, signal processing , and machine learning. This paper introduces the 5th CHiME Challenge, which considers the task of distant multi-microphone conversational ASR in real home environments. Speech material was elicited using a dinner party scenario with efforts taken to capture data that is representative of natural conversational speech and recorded by 6 Kinect microphone arrays and 4 binaural microphone pairs. The challenge features a single-array track and a multiple-array track and, for each track, distinct rankings will be produced for systems focusing on robustness with respect to distant-microphone capture vs. systems attempting to address all aspects of the task including conversational language modeling. We discuss the rationale for the challenge and provide a detailed description of the data collection procedure, the task, and the baseline systems for array synchronization, speech enhancement, and conventional and end-to-end ASR

    Quantifying the value of pronunciation lexicons for keyword search in low resource languages

    Get PDF
    ABSTRACT This paper quantifies the value of pronunciation lexicons in large vocabulary continuous speech recognition (LVCSR) systems that support keyword search (KWS) in low resource languages. Stateof-the-art LVCSR and KWS systems are developed for conversational telephone speech in Tagalog, and the baseline lexicon is augmented via three different grapheme-to-phoneme models that yield increasing coverage of a large Tagalog word-list. It is demonstrated that while the increased lexical coverage -or reduced out-of-vocabulary (OOV) rate -leads to only modest (ca 1%-4%) improvements in word error rate, the concomitant improvements in actual term weighted value are as much as 60%. It is also shown that incorporating the augmented lexicons into the LVCSR system before indexing speech is superior to using them post facto, e.g., for approximate phonetic matching of OOV keywords in pre-indexed lattices. These results underscore the disproportionate importance of automatic lexicon augmentation for KWS in morphologically rich languages, and advocate for using them early in the LVCSR stage. Index Terms-Speech Recognition, Keyword Search, Information Retrieval, Morphology, Speech Synthesis LOW-RESOURCE KEYWORD SEARCH Thanks in part to the falling costs of storage and transmission, large volumes of speech such as oral history archives [1, 2] and on-line lectures We are interested in improving KWS performance in a low resource setting, i.e. where some resources are available to develop The authors, listed here in alphabetical order, were supported by DARPA BOLT contract Nō HR0011-12-C-0015, and IARPA BABEL contract Nō W911NF-12-C-0015. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright annotation thereon. Disclaimer: The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of DARPA, IARPA, DoD/ARL or the U.S. Government. an LVCSR system -such as 10 hours of transcribed speech corresponding to about 100K words of transcribed text, and a pronunciation lexicon that covers the words in the training data -but accuracy is sufficiently low that considerable improvement in KWS performance is necessary before the system is usable for searching a speech collection. A fair amount of past research has been devoted to improving the acoustic models from un-transcribed speech The importance of pronunciation lexicons for LVCSR is not entirely underestimated. Several papers have addressed the problem of automatically generating pronunciations for out of vocabulary (OOV) words Two notable exceptions to this conventional wisdom are (i) accuracy on infrequent, content-bearing words, which are more likely to be OOV, and (ii) accuracy in morphologically rich languages, e.g. Czech and Turkish. These exceptions come together in a detrimental fashion when developing KWS systems for a morphologically rich, low resource language such as Tagalog. This is the setting in which we will quantify the impact of increasing lexical coverage on the performance of a KWS system. We assume a transcribed corpus of 10 hours of Tagalog conversational telephone speech We first develop state-of-the-art LVCSR and KWS systems based on the given resources. We process and index a 10 hour search collection using the KWS system, and measure KWS performance using a set of 355 Tagalog queries. We then explore three different methods for augmenting the 5.7K word lexicon to include additional words seen in the larger LM training corpus. The augmented lexicons are used to improve the KWS system in two different ways: reprocessing the speech with the larger lexicon, or using it during keyword search. The efficacy of the augmented lexicons is measured in terms of 8560 978-1-4799-0356-6/13/$31.0

    CHiME-6 Challenge:Tackling Multispeaker Speech Recognition for Unsegmented Recordings

    Get PDF
    Following the success of the 1st, 2nd, 3rd, 4th and 5th CHiME challenges we organize the 6th CHiME Speech Separation and Recognition Challenge (CHiME-6). The new challenge revisits the previous CHiME-5 challenge and further considers the problem of distant multi-microphone conversational speech diarization and recognition in everyday home environments. Speech material is the same as the previous CHiME-5 recordings except for accurate array synchronization. The material was elicited using a dinner party scenario with efforts taken to capture data that is representative of natural conversational speech. This paper provides a baseline description of the CHiME-6 challenge for both segmented multispeaker speech recognition (Track 1) and unsegmented multispeaker speech recognition (Track 2). Of note, Track 2 is the first challenge activity in the community to tackle an unsegmented multispeaker speech recognition scenario with a complete set of reproducible open source baselines providing speech enhancement, speaker diarization, and speech recognition modules
    corecore